This experiment is a full differential expression analysis of PARAM.TYPE.OF.SAMPLE samples to find the changes in gene expression caused associated with PARAM.SAMPLE.ASSOCIATED.WITH.
The analysis protocol is based on the sequential execution of the following steps:
Get a matrix with the raw counts from the set of samples we want to analyse.
We are working with 53 samples. We might need more samples to obtain significant results.
We monitor some QC aspects and normalise the data according to several criteria.
# Field of View (FOV) Plots
plotFOV(eset = eset, metadata = metadata, fov.threshold = opt$fov,
comparison.key = key.label, legend.label = opt$label)
Due to the nature of the nCounter technology, analysis of some samples may produce too many or too few probes to be accurately counted by the Digital Analyzer. When too many probes are present, the Digital Analyzer is not able to distinguish each and every probe present in the lane. When too few fluorescent species are present, the Digital Analyzer may have difficulty focusing on the lane surface. Therefore, a measurement of mean binding density (spots per square micron) is provided with each lane scanned. The linear range of counting extends from 0.05 to 2.25 spots per square micron for assays run on an nCounter MAX or FLEX system. The range is 0.05 to 1.18 spots per square micron for assays run on the nCounter SPRINT system. (see nSolver User Manual).
# Binding Density (BD) Plots
plotBD(eset = eset, metadata = metadata, y.thresholds = c(opt$bd_min, opt$bd_max),
comparison.key = key.label, legend.label = opt$label)
Check the expression of the positive controls. The positive genes follow the expeted pattern of expresssion.
# Positive Controls
boxplot.expr(eset,is.positive)
Check the expression of the negative controls
# Negative Controls
boxplot.expr(eset,is.negative)
We establish a noise threshold. This threshold is based on the mean and standard deviation of counts of the negative control genes and represents the background noise. We define it as the mean expression of the negative genes counts + 2 times the standard deviation.
# Noise Threshold
lodcounts <- extract.pred(eset, is.negative)
lod <- mean(lodcounts$count) + 2 * sd(lodcounts$count)
Expression of each housekeeping genes in all samples. The line in red represents the noise threshold.
We can observe that all the housekeeping genes have enough expression (above the noise threshold).
# Housekeeping Genes
housekeeping.boxplot<-boxplot.expr(eset,is.housekeeping)
housekeeping.boxplot+geom_hline(yintercept = (lod),colour="red")
We plot the mean expression of all the housekeeping genes in each sample. We can obverve that some sample have overall low expression for the housekeeping genes, but all are above the noise threshold.
# Expression of all the housekeeping genes in each sample
housekeeping.condition.boxplot+geom_hline(yintercept = (lod),colour="red")
Expression of all genes (Endogenous + Housekeeping).
# Expression of multiple condition genes in each sample
endogenous.housekeeping.condition.boxplot+geom_hline(yintercept = (lod),colour="red")
We perform a normalization using the expression of the positive genes. This attempts to normalize for technical noise across the samples
pre.normalization.boxplot + geom_hline(yintercept = lod,colour="red")
post.normalization.boxplot + geom_hline(yintercept = lod,colour="red")
We can see that this normalisation doesn’t make the samples look better.
We select thouse housekeeping genes that have expression values greater than the noise threshold and a mean value of expression of at least 200 counts.
Using those genes, we observe that samples obtain more similar profiles.
housekeeping.post.normalization.boxplot + geom_hline(yintercept = lod,colour="red") + geom_smooth(se=T, aes(group=1))
We’ll drop the genes that are below the LOD in over 80% of the samples:
# SHOW THIS?
We can save the metadata dataframe and the normalised counts matrix for further analysis.
# ALREDY SAVE
A PCA (Principal Component Analysis) performs a transformation over the data in order to obtain orthogonal vectors in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.
Plotting the samples in this transformed system show which samples are more similar. Here we scaled and centered the normalized data and then performed PCA to look at how the samples cluster. The samples do not cluster clearly along components 1-3 by their invasive potential, which indicates the samples are more variable than the signal, if the signal exists.
plotPCA(comps, PRINCIPAL.COMPONENT.X, PRINCIPAL.COMPONENT.Y, key.label
, legend.label = opt$label, plot.title = PCA.PLOT.TITLE)
DEA by the DESeq2 library. The ncounts matrix is converted to integers for the analysis. The original counts are preserved in counts(dds).
See this thread for log2FC.
# SHOW THIS?
The results of the differential expression analysis:
dplyr::select(dea.p, type, gene.name, id, baseMean, log2FoldChange, pvalue, padj)
We see the 188 genes sorted by log2FC with significantly altered expression between samples of invasive tumours and samples with non-invasive tumours. We can show the results of the differential expression analysis as a volcano plot. A volcano plot typically plots some measure of effect on the x-axis (typically the fold change) and the statistical significance on the y-axis (typically the -log10 of the p-value). Genes that are highly dysregulated are farther to the left and right sides, while highly significant changes appear higher on the plot.
plot.Volcano(tab, tab2, opt$log_fc, opt$dea_pvalue, plot.title = "Volcano Plot")
We perform a classic enrichment analysis in GO and KEGG for all the genes differentially epressed by NAC.
#res<- getDE.raw(ncounts = round(ncounts), metadata = metadata, design = design) # the core script for the pathway analysis expects a dataframe with the DEA named res
#We need to first pull out the Refseq IDs from the Nanostring IDs and convert those to Entrez IDs. Not all of the genes have Entrez IDs, however, #we lose `r lost.genes` genes through this conversion.
#converted = unlist(lapply(strsplit(res$rowname, "_", fixed = TRUE), "[", 4))
#converted = unlist(lapply(strsplit(converted, ".", fixed = TRUE), "[", 1))
#converted = paste0("NM_", converted)
#res$refseq_mrna = converted
#lost.genes = length(res$rowname) - length((res %>% data.frame() %>% left_join(entrez, by = "refseq_mrna") %>%
# filter(!is.na(entrezgene)) %>% filter(!is.na(log2FoldChange)) %>% filter(!is.na(lfcSE)))$rowname)
#enrich_rs = enrich_cp(res, "Invasiveness", type="all")
#enrich_rs = enrich_rs
#enrich_summary = enrich_rs$summary %>% arrange(p.adjust)
#enrich_summary = convert_enriched_ids(enrich_summary,entrezsymbol = entrezsymbol) %>% arrange(p.adjust)
#write_csv(x = enrich_summary,path = "NAC_enrichment.csv" )
The pathways enriched in genes with altered expression among our samples are presented in a table:
enrich.rs.summary
enrich.rs.over.summary
enrich.rs.under.summary
Finally, we can see the KEGG pathways enriched.
dotplot(enrich.rs$kg, x="count", showCategory=10, colorBy="qvalue", title = DOTPLOT.ENRICHMENT.ALL.TITLE)
# KEGG enrichment in over-expressed genes:
dotplot(enrich.rs.over$kg, x="count", showCategory=10, colorBy="qvalue", title = DOTPLOT.ENRICHMENT.OVER.TITLE)
# KEGG enrichment in under-expressed genes
dotplot(enrich.rs.under$kg, x="count", showCategory=10, colorBy="qvalue", title = DOTPLOT.ENRICHMENT.UNDER.TITLE)
We can map the genes to a KEGG pathway with this function:
# how do u want to view this?
For example, this is how the pathway hsa04151 looks like: